Assignment 4 DSEM Report¶

The assignment has the following requirements:

  1. What is the question?

  2. What did you do?

  3. How well did it work?

  4. What did you learn?

Abstract¶

The task of the assignment is to comibine all the teachings from the previous assignments and combine them in one cohesive code article.

Data¶

The data that is being used in this article is from IBM HR Analytics present on kaggle. You can use this following link to download the data from my repository.

The data has following columns:

  • Age
  • Attrition
  • BusinessTravel
  • DailyRate
  • Department
  • DistanceFromHome
  • Eduction
  • EducationField
  • EmployeeCount
  • EmployeeNumber
  • EnvironmentSatisfaction
  • Gender
  • HourlyRate
  • JobInvolvement
  • JobLevel
  • JobRole
  • JobSatisfaction
  • MaritalStatus
  • MonthlyIncome
  • MonthlyRate
  • NumCompaniesWorked
  • Over18
  • Overtime
  • PercentSalaryHike
  • StandardHours
  • StockOptionLevel
  • TotalWorkingYears
  • TrainingTimesLastYear
  • WorkLifeBalance
  • YearsAtCompany
  • YearsInCurrentRole
  • YearsSinceLastPromotion
  • YearsWithCurrManager

Setup Stage¶

In [ ]:
!pip install h2o
!pip install shap
!pip install seaborn
import seaborn as sns
import shap
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
import h2o
from h2o.automl import H2OAutoML
from h2o.estimators.gbm import H2OGradientBoostingEstimator
from h2o.grid.grid_search import H2OGridSearch


from sklearn.model_selection import train_test_split
import seaborn as sns
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting h2o
  Downloading h2o-3.40.0.3.tar.gz (177.6 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 177.6/177.6 MB 4.1 MB/s eta 0:00:00
  Preparing metadata (setup.py) ... done
Requirement already satisfied: requests in /usr/local/lib/python3.9/dist-packages (from h2o) (2.27.1)
Requirement already satisfied: tabulate in /usr/local/lib/python3.9/dist-packages (from h2o) (0.8.10)
Requirement already satisfied: future in /usr/local/lib/python3.9/dist-packages (from h2o) (0.18.3)
Requirement already satisfied: urllib3<1.27,>=1.21.1 in /usr/local/lib/python3.9/dist-packages (from requests->h2o) (1.26.15)
Requirement already satisfied: charset-normalizer~=2.0.0 in /usr/local/lib/python3.9/dist-packages (from requests->h2o) (2.0.12)
Requirement already satisfied: certifi>=2017.4.17 in /usr/local/lib/python3.9/dist-packages (from requests->h2o) (2022.12.7)
Requirement already satisfied: idna<4,>=2.5 in /usr/local/lib/python3.9/dist-packages (from requests->h2o) (3.4)
Building wheels for collected packages: h2o
  Building wheel for h2o (setup.py) ... done
  Created wheel for h2o: filename=h2o-3.40.0.3-py2.py3-none-any.whl size=177694727 sha256=1e014a0c4351d1e2385a0d1f73c0633e73365ae77508703b4958c07a319f2129
  Stored in directory: /root/.cache/pip/wheels/9a/54/b6/c9ab3e71309ef0000bbe39e715020dc151bbfc557784b7f4c9
Successfully built h2o
Installing collected packages: h2o
Successfully installed h2o-3.40.0.3
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting shap
  Downloading shap-0.41.0-cp39-cp39-manylinux_2_12_x86_64.manylinux2010_x86_64.whl (572 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 572.4/572.4 KB 10.3 MB/s eta 0:00:00
Requirement already satisfied: numpy in /usr/local/lib/python3.9/dist-packages (from shap) (1.22.4)
Requirement already satisfied: tqdm>4.25.0 in /usr/local/lib/python3.9/dist-packages (from shap) (4.65.0)
Requirement already satisfied: numba in /usr/local/lib/python3.9/dist-packages (from shap) (0.56.4)
Requirement already satisfied: packaging>20.9 in /usr/local/lib/python3.9/dist-packages (from shap) (23.0)
Requirement already satisfied: scipy in /usr/local/lib/python3.9/dist-packages (from shap) (1.10.1)
Collecting slicer==0.0.7
  Downloading slicer-0.0.7-py3-none-any.whl (14 kB)
Requirement already satisfied: cloudpickle in /usr/local/lib/python3.9/dist-packages (from shap) (2.2.1)
Requirement already satisfied: pandas in /usr/local/lib/python3.9/dist-packages (from shap) (1.4.4)
Requirement already satisfied: scikit-learn in /usr/local/lib/python3.9/dist-packages (from shap) (1.2.2)
Requirement already satisfied: llvmlite<0.40,>=0.39.0dev0 in /usr/local/lib/python3.9/dist-packages (from numba->shap) (0.39.1)
Requirement already satisfied: setuptools in /usr/local/lib/python3.9/dist-packages (from numba->shap) (67.6.1)
Requirement already satisfied: pytz>=2020.1 in /usr/local/lib/python3.9/dist-packages (from pandas->shap) (2022.7.1)
Requirement already satisfied: python-dateutil>=2.8.1 in /usr/local/lib/python3.9/dist-packages (from pandas->shap) (2.8.2)
Requirement already satisfied: joblib>=1.1.1 in /usr/local/lib/python3.9/dist-packages (from scikit-learn->shap) (1.1.1)
Requirement already satisfied: threadpoolctl>=2.0.0 in /usr/local/lib/python3.9/dist-packages (from scikit-learn->shap) (3.1.0)
Requirement already satisfied: six>=1.5 in /usr/local/lib/python3.9/dist-packages (from python-dateutil>=2.8.1->pandas->shap) (1.16.0)
Installing collected packages: slicer, shap
Successfully installed shap-0.41.0 slicer-0.0.7
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Requirement already satisfied: seaborn in /usr/local/lib/python3.9/dist-packages (0.12.2)
Requirement already satisfied: pandas>=0.25 in /usr/local/lib/python3.9/dist-packages (from seaborn) (1.4.4)
Requirement already satisfied: matplotlib!=3.6.1,>=3.1 in /usr/local/lib/python3.9/dist-packages (from seaborn) (3.7.1)
Requirement already satisfied: numpy!=1.24.0,>=1.17 in /usr/local/lib/python3.9/dist-packages (from seaborn) (1.22.4)
Requirement already satisfied: cycler>=0.10 in /usr/local/lib/python3.9/dist-packages (from matplotlib!=3.6.1,>=3.1->seaborn) (0.11.0)
Requirement already satisfied: pyparsing>=2.3.1 in /usr/local/lib/python3.9/dist-packages (from matplotlib!=3.6.1,>=3.1->seaborn) (3.0.9)
Requirement already satisfied: packaging>=20.0 in /usr/local/lib/python3.9/dist-packages (from matplotlib!=3.6.1,>=3.1->seaborn) (23.0)
Requirement already satisfied: pillow>=6.2.0 in /usr/local/lib/python3.9/dist-packages (from matplotlib!=3.6.1,>=3.1->seaborn) (8.4.0)
Requirement already satisfied: kiwisolver>=1.0.1 in /usr/local/lib/python3.9/dist-packages (from matplotlib!=3.6.1,>=3.1->seaborn) (1.4.4)
Requirement already satisfied: python-dateutil>=2.7 in /usr/local/lib/python3.9/dist-packages (from matplotlib!=3.6.1,>=3.1->seaborn) (2.8.2)
Requirement already satisfied: contourpy>=1.0.1 in /usr/local/lib/python3.9/dist-packages (from matplotlib!=3.6.1,>=3.1->seaborn) (1.0.7)
Requirement already satisfied: importlib-resources>=3.2.0 in /usr/local/lib/python3.9/dist-packages (from matplotlib!=3.6.1,>=3.1->seaborn) (5.12.0)
Requirement already satisfied: fonttools>=4.22.0 in /usr/local/lib/python3.9/dist-packages (from matplotlib!=3.6.1,>=3.1->seaborn) (4.39.3)
Requirement already satisfied: pytz>=2020.1 in /usr/local/lib/python3.9/dist-packages (from pandas>=0.25->seaborn) (2022.7.1)
Requirement already satisfied: zipp>=3.1.0 in /usr/local/lib/python3.9/dist-packages (from importlib-resources>=3.2.0->matplotlib!=3.6.1,>=3.1->seaborn) (3.15.0)
Requirement already satisfied: six>=1.5 in /usr/local/lib/python3.9/dist-packages (from python-dateutil>=2.7->matplotlib!=3.6.1,>=3.1->seaborn) (1.16.0)

Data Cleaning¶

In [ ]:
### Reading data from the github repository
data = pd.read_csv('https://raw.githubusercontent.com/TarushS-1996/DataScience_001067923/main/IBMHRAttritionDataset.csv')
### getting data type 
data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1470 entries, 0 to 1469
Data columns (total 35 columns):
 #   Column                    Non-Null Count  Dtype 
---  ------                    --------------  ----- 
 0   Age                       1470 non-null   int64 
 1   Attrition                 1470 non-null   object
 2   BusinessTravel            1470 non-null   object
 3   DailyRate                 1470 non-null   int64 
 4   Department                1470 non-null   object
 5   DistanceFromHome          1470 non-null   int64 
 6   Education                 1470 non-null   int64 
 7   EducationField            1470 non-null   object
 8   EmployeeCount             1470 non-null   int64 
 9   EmployeeNumber            1470 non-null   int64 
 10  EnvironmentSatisfaction   1470 non-null   int64 
 11  Gender                    1470 non-null   object
 12  HourlyRate                1470 non-null   int64 
 13  JobInvolvement            1470 non-null   int64 
 14  JobLevel                  1470 non-null   int64 
 15  JobRole                   1470 non-null   object
 16  JobSatisfaction           1470 non-null   int64 
 17  MaritalStatus             1470 non-null   object
 18  MonthlyIncome             1470 non-null   int64 
 19  MonthlyRate               1470 non-null   int64 
 20  NumCompaniesWorked        1470 non-null   int64 
 21  Over18                    1470 non-null   object
 22  OverTime                  1470 non-null   object
 23  PercentSalaryHike         1470 non-null   int64 
 24  PerformanceRating         1470 non-null   int64 
 25  RelationshipSatisfaction  1470 non-null   int64 
 26  StandardHours             1470 non-null   int64 
 27  StockOptionLevel          1470 non-null   int64 
 28  TotalWorkingYears         1470 non-null   int64 
 29  TrainingTimesLastYear     1470 non-null   int64 
 30  WorkLifeBalance           1470 non-null   int64 
 31  YearsAtCompany            1470 non-null   int64 
 32  YearsInCurrentRole        1470 non-null   int64 
 33  YearsSinceLastPromotion   1470 non-null   int64 
 34  YearsWithCurrManager      1470 non-null   int64 
dtypes: int64(26), object(9)
memory usage: 402.1+ KB

Plotting the data¶

In [ ]:
### Checking if data is null or not
data.isnull().sum()

### Encoding data as 1 and 0 without increasing the dimensionality of the database
one_hot = {'Yes': 1, 'No': 0, 'Y':1, 'N':0, 'Male': 0, 'Female': 1}
data.Attrition = [one_hot[item] for item in data.Attrition]
data.OverTime = [one_hot[item] for item in data.OverTime]
data.Over18 = [one_hot[item] for item in data.Over18]
data.Gender = [one_hot[item] for item in data.Gender]

### Using pd.get_dummies() to create one-hot encofidng where data type was Object
data = pd.get_dummies(data, columns = ['BusinessTravel', 'Department', 'MaritalStatus', 'EducationField'])
data = data.drop(['EmployeeCount', 'HourlyRate', 'DailyRate', 'Over18', 'StandardHours', 'JobRole', 'EmployeeNumber'], axis = 1)

### Specifying data type as int64
data = data.astype('int64')
### Making sure data type is correctly changed
print(data.info())
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1470 entries, 0 to 1469
Data columns (total 39 columns):
 #   Column                             Non-Null Count  Dtype
---  ------                             --------------  -----
 0   Age                                1470 non-null   int64
 1   Attrition                          1470 non-null   int64
 2   DistanceFromHome                   1470 non-null   int64
 3   Education                          1470 non-null   int64
 4   EnvironmentSatisfaction            1470 non-null   int64
 5   Gender                             1470 non-null   int64
 6   JobInvolvement                     1470 non-null   int64
 7   JobLevel                           1470 non-null   int64
 8   JobSatisfaction                    1470 non-null   int64
 9   MonthlyIncome                      1470 non-null   int64
 10  MonthlyRate                        1470 non-null   int64
 11  NumCompaniesWorked                 1470 non-null   int64
 12  OverTime                           1470 non-null   int64
 13  PercentSalaryHike                  1470 non-null   int64
 14  PerformanceRating                  1470 non-null   int64
 15  RelationshipSatisfaction           1470 non-null   int64
 16  StockOptionLevel                   1470 non-null   int64
 17  TotalWorkingYears                  1470 non-null   int64
 18  TrainingTimesLastYear              1470 non-null   int64
 19  WorkLifeBalance                    1470 non-null   int64
 20  YearsAtCompany                     1470 non-null   int64
 21  YearsInCurrentRole                 1470 non-null   int64
 22  YearsSinceLastPromotion            1470 non-null   int64
 23  YearsWithCurrManager               1470 non-null   int64
 24  BusinessTravel_Non-Travel          1470 non-null   int64
 25  BusinessTravel_Travel_Frequently   1470 non-null   int64
 26  BusinessTravel_Travel_Rarely       1470 non-null   int64
 27  Department_Human Resources         1470 non-null   int64
 28  Department_Research & Development  1470 non-null   int64
 29  Department_Sales                   1470 non-null   int64
 30  MaritalStatus_Divorced             1470 non-null   int64
 31  MaritalStatus_Married              1470 non-null   int64
 32  MaritalStatus_Single               1470 non-null   int64
 33  EducationField_Human Resources     1470 non-null   int64
 34  EducationField_Life Sciences       1470 non-null   int64
 35  EducationField_Marketing           1470 non-null   int64
 36  EducationField_Medical             1470 non-null   int64
 37  EducationField_Other               1470 non-null   int64
 38  EducationField_Technical Degree    1470 non-null   int64
dtypes: int64(39)
memory usage: 448.0 KB
None
In [ ]:
from statsmodels.graphics.gofplots import qqplot
data_col = data[['Age', 'DistanceFromHome', 'EnvironmentSatisfaction', 'JobSatisfaction', 'Gender', 'MonthlyIncome', 'MonthlyRate', 'NumCompaniesWorked', 'PercentSalaryHike', 'PerformanceRating', 'StockOptionLevel', 'TotalWorkingYears', 'WorkLifeBalance', 'YearsAtCompany', 'YearsInCurrentRole', 'YearsSinceLastPromotion', 'Attrition']]
for c in data_col.columns[:]:
  plt.figure(figsize=(8,5))
  fig=qqplot(data_col[c],line='45',fit='True')
  plt.xticks(fontsize=13)
  plt.yticks(fontsize=13)
  plt.xlabel("Theoretical quantiles",fontsize=15)
  plt.ylabel("Sample quantiles",fontsize=15)
  plt.title("Q-Q plot of {}".format(c),fontsize=16)
  plt.grid(True)
  plt.show()
<Figure size 800x500 with 0 Axes>
<Figure size 800x500 with 0 Axes>
<Figure size 800x500 with 0 Axes>
<Figure size 800x500 with 0 Axes>
<Figure size 800x500 with 0 Axes>
<Figure size 800x500 with 0 Axes>
<Figure size 800x500 with 0 Axes>
<Figure size 800x500 with 0 Axes>
<Figure size 800x500 with 0 Axes>
<Figure size 800x500 with 0 Axes>
<Figure size 800x500 with 0 Axes>
<Figure size 800x500 with 0 Axes>
<Figure size 800x500 with 0 Axes>
<Figure size 800x500 with 0 Axes>
<Figure size 800x500 with 0 Axes>
<Figure size 800x500 with 0 Axes>
<Figure size 800x500 with 0 Axes>
In [ ]:
### Getting the count of class samples and making sure there are almost equal samples for various classes.
print("Job satisfaction Count: ")
print(data.JobSatisfaction.value_counts())
Job satisfaction Count: 
4    459
3    442
1    289
2    280
Name: JobSatisfaction, dtype: int64

As we can see above Job Satisfaction has different number of examples. The model always tend to perform better with equal number of samples for training data and might suffer from overfitting.

From the above count we can see that for the classes and samples

class Sample
1.0 280
2.0 289
3.0 442
4.0 459

there is difference in values of 1.0 and 2.0 vs 3.0 and 4.0.

Thus we can assume that regularization might be able to help reduce the high variance.

On plotting correlation heatmap and viewing the values.

In [ ]:
### Setting up the figure size
plt.figure(figsize=(30,10))
### Spcifying the type of plot, data, colormap and annotation using sns library
sns.heatmap(data.corr(), annot=True, cmap='RdYlGn')
Out[ ]:
<Axes: >

Form the above correaltion matrix we can remove some of the values as they seem to be highly related to eachother and replot the heatmap of the correlation matrix

In [ ]:
data = data.drop(['JobLevel', 'MonthlyIncome', 'PerformanceRating', 'TotalWorkingYears'], axis = 1)

### Replotting the heatmap to see if any other value can be removed.
plt.figure(figsize=(30,10))
sns.heatmap(data.corr(), annot=True, cmap='RdYlGn')
Out[ ]:
<Axes: >

Checking the data for outliers and scaling¶

In [ ]:
plt.figure(figsize=(50, 20))
sns.boxplot(data=data)
Out[ ]:
<Axes: >

As we can see, the data needs normalization

In [ ]:
from sklearn import preprocessing

dataScaled = pd.get_dummies(data)
min_max_scalar = preprocessing.MinMaxScaler()


xMR = dataScaled[['MonthlyRate']].values.astype(int)
xMRS = min_max_scalar.fit_transform(xMR)
dataScaled['MonthlyRate'] = pd.DataFrame(xMRS)

xA = dataScaled[['Age']].values.astype(int)
xAS = min_max_scalar.fit_transform(xA)
dataScaled['Age'] = pd.DataFrame(xAS)

xD = dataScaled[['DistanceFromHome']].values.astype(int)
xDS = min_max_scalar.fit_transform(xD)
dataScaled['DistanceFromHome'] = pd.DataFrame(xDS)

plt.figure(figsize=(50, 20))
sns.boxplot(data=dataScaled)
Out[ ]:
<Axes: >

As we can see our current data has outliers as well. This might have some impact on model performance. We could use regularization for our model

In [ ]:
data = dataScaled

Linear regression¶

Now we can fit the data to a linear regression model and predict for job satisfaction rating of a company.

To do this we will speciy the x and y values and for x value we will drop the column that is our y.

In [ ]:
### Sepcifying the y value
y = data.Attrition

### Specifying the x value and dropping JobSatisfcation from data.
x = data.drop(['Attrition'], axis = 1)

### Checking if x has the right columns present
print(x.columns)
### Checking if y value is correctly selected
print(y)
Index(['Age', 'DistanceFromHome', 'Education', 'EnvironmentSatisfaction',
       'Gender', 'JobInvolvement', 'JobSatisfaction', 'MonthlyRate',
       'NumCompaniesWorked', 'OverTime', 'PercentSalaryHike',
       'RelationshipSatisfaction', 'StockOptionLevel', 'TrainingTimesLastYear',
       'WorkLifeBalance', 'YearsAtCompany', 'YearsInCurrentRole',
       'YearsSinceLastPromotion', 'YearsWithCurrManager',
       'BusinessTravel_Non-Travel', 'BusinessTravel_Travel_Frequently',
       'BusinessTravel_Travel_Rarely', 'Department_Human Resources',
       'Department_Research & Development', 'Department_Sales',
       'MaritalStatus_Divorced', 'MaritalStatus_Married',
       'MaritalStatus_Single', 'EducationField_Human Resources',
       'EducationField_Life Sciences', 'EducationField_Marketing',
       'EducationField_Medical', 'EducationField_Other',
       'EducationField_Technical Degree'],
      dtype='object')
0       1
1       0
2       1
3       0
4       0
       ..
1465    0
1466    0
1467    0
1468    0
1469    0
Name: Attrition, Length: 1470, dtype: int64

OLS Summary¶

Now to better understand if the x values and y values don't have a null hypothesis, we can use OLS Summary.

Note: Null hypothesis is a test to determine if the X and Y value share some sort of relation between each other.

To determine this we will look at F-Statistic of the OLS Summary. If the value is 0 then there is no relationship between the X and Y value. If the value is greater than 0 it means there is a relationship. The higher the value of F-statistic, the stronger the relationship between the X and Y value.

In [ ]:
### Splitting the data into train and test sets for both x and y value. SPecifying the shuffle (Shuffle allows for mixing data) and train size
x_train, x_test, y_train, y_test = train_test_split(x,y,test_size=0.25,shuffle = True)
import statsmodels.api as sm
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression

x_value = data.drop(['Attrition'], axis = 1)
linear_model = sm.OLS(data.JobSatisfaction, x_value).fit()
print(linear_model.summary())
                            OLS Regression Results                            
==============================================================================
Dep. Variable:        JobSatisfaction   R-squared:                       1.000
Model:                            OLS   Adj. R-squared:                  1.000
Method:                 Least Squares   F-statistic:                 4.430e+29
Date:                Sun, 09 Apr 2023   Prob (F-statistic):               0.00
Time:                        22:47:15   Log-Likelihood:                 45099.
No. Observations:                1470   AIC:                        -9.014e+04
Df Residuals:                    1439   BIC:                        -8.997e+04
Df Model:                          30                                         
Covariance Type:            nonrobust                                         
=====================================================================================================
                                        coef    std err          t      P>|t|      [0.025      0.975]
-----------------------------------------------------------------------------------------------------
Age                                4.892e-16   1.61e-15      0.303      0.762   -2.68e-15    3.66e-15
DistanceFromHome                  -9.116e-16   1.05e-15     -0.867      0.386   -2.97e-15    1.15e-15
Education                         -1.414e-16   3.06e-16     -0.462      0.644   -7.42e-16    4.59e-16
EnvironmentSatisfaction            2.544e-16   2.79e-16      0.911      0.363   -2.94e-16    8.02e-16
Gender                            -2.828e-16   6.23e-16     -0.454      0.650   -1.51e-15     9.4e-16
JobInvolvement                     1.587e-16   4.29e-16      0.370      0.711   -6.83e-16       1e-15
JobSatisfaction                       1.0000   2.77e-16   3.62e+15      0.000       1.000       1.000
MonthlyRate                        3.287e-16   1.07e-15      0.308      0.758   -1.76e-15    2.42e-15
NumCompaniesWorked                -5.941e-17   1.32e-16     -0.449      0.654   -3.19e-16       2e-16
OverTime                          -1.154e-16   6.81e-16     -0.169      0.865   -1.45e-15    1.22e-15
PercentSalaryHike                 -4.344e-16   8.33e-17     -5.216      0.000   -5.98e-16   -2.71e-16
RelationshipSatisfaction           2.344e-16   2.83e-16      0.828      0.408   -3.21e-16     7.9e-16
StockOptionLevel                   8.674e-17   4.89e-16      0.177      0.859   -8.72e-16    1.05e-15
TrainingTimesLastYear             -2.248e-16   2.38e-16     -0.945      0.345   -6.91e-16    2.42e-16
WorkLifeBalance                   -1.592e-16   4.33e-16     -0.368      0.713   -1.01e-15     6.9e-16
YearsAtCompany                    -2.764e-16   9.63e-17     -2.870      0.004   -4.65e-16   -8.75e-17
YearsInCurrentRole                -3.057e-17   1.38e-16     -0.222      0.825   -3.01e-16     2.4e-16
YearsSinceLastPromotion           -2.627e-16   1.22e-16     -2.160      0.031   -5.01e-16   -2.41e-17
YearsWithCurrManager              -1.108e-16   1.41e-16     -0.786      0.432   -3.87e-16    1.66e-16
BusinessTravel_Non-Travel         -3.697e-17   1.12e-15     -0.033      0.974   -2.24e-15    2.17e-15
BusinessTravel_Travel_Frequently  -2.272e-16   1.02e-15     -0.222      0.824   -2.23e-15    1.78e-15
BusinessTravel_Travel_Rarely       -1.37e-16   9.14e-16     -0.150      0.881   -1.93e-15    1.66e-15
Department_Human Resources        -2.511e-16   1.64e-15     -0.153      0.878   -3.47e-15    2.96e-15
Department_Research & Development  2.901e-16   1.06e-15      0.274      0.784   -1.79e-15    2.37e-15
Department_Sales                   3.469e-17   1.13e-15      0.031      0.975   -2.17e-15    2.24e-15
MaritalStatus_Divorced             -2.09e-16   1.06e-15     -0.197      0.844   -2.29e-15    1.88e-15
MaritalStatus_Married              3.452e-16   9.37e-16      0.368      0.713   -1.49e-15    2.18e-15
MaritalStatus_Single               5.109e-16   1.01e-15      0.507      0.612   -1.46e-15    2.49e-15
EducationField_Human Resources    -1.926e-16   2.53e-15     -0.076      0.939   -5.15e-15    4.77e-15
EducationField_Life Sciences       3.454e-16   8.06e-16      0.429      0.668   -1.24e-15    1.93e-15
EducationField_Marketing           4.432e-16   1.18e-15      0.375      0.708   -1.88e-15    2.76e-15
EducationField_Medical            -6.037e-16   8.52e-16     -0.709      0.478   -2.27e-15    1.07e-15
EducationField_Other              -4.372e-16   1.31e-15     -0.335      0.738      -3e-15    2.12e-15
EducationField_Technical Degree   -1.394e-16    1.1e-15     -0.127      0.899   -2.29e-15    2.01e-15
==============================================================================
Omnibus:                      176.918   Durbin-Watson:                   0.159
Prob(Omnibus):                  0.000   Jarque-Bera (JB):              252.521
Skew:                           0.894   Prob(JB):                     1.46e-55
Kurtosis:                       3.964   Cond. No.                     1.11e+16
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The smallest eigenvalue is 4.86e-27. This might indicate that there are
strong multicollinearity problems or that the design matrix is singular.

What is the question?¶

In our dataset we are trying to predict the job satisfaction level based on the input values. For this we can use Null Hypothesis to state if there is a relationship between the input and output variables.

As we can see above, the F-Statistics value is not 0 and thus we can safely assume that there is a relationship the X and Y value.


Note: F-Statistic is a way to check if there is a relationship between X-value and Y-value. If the value of F-Statistic is 0, then there is no relationship between the variables. The higher the number the more significant the realtionship is.

Further we can also look at T-Statistics values and see that some of the values imapact the values in postivie and negative way significantly. This can be further analysed using SHAP analysis.

For the next step, we will be using a simple Linear regressor as our model and perform SHAP analysis on.

Linear regression¶

In [ ]:
### For analysis of the model's performance, we will include mean_squared_error
from sklearn.metrics import mean_squared_error

### We fit the train values for x and y to our linear model
Linear_regression = LinearRegression().fit(x_train, y_train)
### Getting the pr
y_prediction = Linear_regression.predict(x_test)
### Rounding up the predicted values. 
y_prediction = np.rint(y_prediction)
### Seperating the coeffients that are used for the prediction
coeff = Linear_regression.coef_

###Placing the data in a dataframe for further analysis.
analysis = pd.DataFrame()
analysis['Y_Predictions'] = y_prediction
analysis['Y_actual'] = y_test
analysis = analysis.dropna()
print(analysis)
#sns.residplot(x = analysis['Y_actual'], y = analysis['Y_Predictions'])
pad = pd.DataFrame()
### placing the coefficients and their coulmn names respectivley in the dataframe and printing out the mean squared error and r squared for model performance
pad['ColumnNames'] = x_train.columns
pad['Coefficients'] = coeff
print("The mean squared error is: {}".format(mean_squared_error(analysis['Y_actual'], analysis['Y_Predictions'])))
print("The R^2 score on train data is: {}".format(Linear_regression.score(x_train, y_train)))
print(pad)

### Plotting the coefficients
pad.plot.bar()
     Y_Predictions  Y_actual
0              0.0       1.0
4              0.0       0.0
8              0.0       0.0
12            -0.0       0.0
15             0.0       0.0
..             ...       ...
351            0.0       0.0
353            0.0       0.0
354            0.0       0.0
355            0.0       0.0
358            0.0       0.0

[90 rows x 2 columns]
The mean squared error is: 0.2111111111111111
The R^2 score on train data is: 0.24095264654344173
                          ColumnNames  Coefficients
0                                 Age     -0.306338
1                    DistanceFromHome      0.111671
2                           Education     -0.003506
3             EnvironmentSatisfaction     -0.042771
4                              Gender     -0.028894
5                      JobInvolvement     -0.050978
6                     JobSatisfaction     -0.046996
7                         MonthlyRate     -0.009996
8                  NumCompaniesWorked      0.014952
9                            OverTime      0.207308
10                  PercentSalaryHike     -0.000959
11           RelationshipSatisfaction     -0.021432
12                   StockOptionLevel     -0.024905
13              TrainingTimesLastYear     -0.006229
14                    WorkLifeBalance     -0.021587
15                     YearsAtCompany      0.003610
16                 YearsInCurrentRole     -0.010255
17            YearsSinceLastPromotion      0.012735
18               YearsWithCurrManager     -0.012513
19          BusinessTravel_Non-Travel     -0.069484
20   BusinessTravel_Travel_Frequently      0.083898
21       BusinessTravel_Travel_Rarely     -0.014414
22         Department_Human Resources      0.015890
23  Department_Research & Development     -0.029488
24                   Department_Sales      0.013597
25             MaritalStatus_Divorced     -0.036949
26              MaritalStatus_Married     -0.029951
27               MaritalStatus_Single      0.066900
28     EducationField_Human Resources      0.016492
29       EducationField_Life Sciences     -0.026858
30           EducationField_Marketing      0.008910
31             EducationField_Medical     -0.045314
32               EducationField_Other     -0.050188
33    EducationField_Technical Degree      0.096957
Out[ ]:
<Axes: >

Shap analysis linear model¶

SHAP (SHapley Additive exPlanations) analysis is a method for interpreting machine learning models. It is used to explain the output of a model by computing the contribution of each feature to the prediction. SHAP values provide a way to estimate the importance of each feature in the model's output for a particular instance.

For analysis, we will plot summaries of the model, use waterfall analysis to determine the importance of features and their impact on the final prediction, use feature importance for all the values of the model, use dependance plots to show how values are dependant on eachother and use heatmap to see contribution of each feature to the final prediction made by the model for a hetter undestanding

In [ ]:
### First we pass our model to the shap explainer to get shaply values of the predictions
linear_explainer_shap = shap.LinearExplainer(Linear_regression, x_train)
### Next to that same explainer we pass the test values and see why the model reached that conclusion and what all values are impacting those conclusions. 
shap_values_linear_regression = linear_explainer_shap(x_test)

### We get the summary of the model's predictions and as a bar plot 
shap.summary_plot(shap_values_linear_regression, x_test, plot_type = 'bar', max_display=14)

From above, we can see that the distribution of shaply values is more for YearsWithCurrManager. However, to determine the positive and negative importance of the values we will not specify the plot type as bar and just see the summary of the shap values.

In [ ]:
shap.summary_plot(shap_values_linear_regression, x_test, max_display = 14)
No data for colormapping provided via 'c'. Parameters 'vmin', 'vmax' will be ignored

But For the same parameter i.e YearsWithCurrManager, if the value becomes lower, it impacts the prediction of the model more negatively than in a positive manner.

We can see that in the graph, the red dots indicate high impact and blue showcase less impact. The direction from the mean line in the center indicates type of impact on the model's output value.

The only value that seems to have a positive impact the higher it become is YearsAtCompany.

Waterfall analysis:¶

In [ ]:
print("Waterfall plot for linear regression")
### Here we specify the plot tyle and specify one instance we want the plot to do analysis on.
shap.plots.waterfall(shap_values_linear_regression[10], max_display = 14)
Waterfall plot for linear regression

Now where the above summary plot gave the general distribution of shap valuese, the waterfall plot is used to see for one single instance of the data which value has a positive and negative impact on the output value. This is better to further analyze variable importance at per instance level.

Feature importance plot:¶

Feature importance plots here are unlike the SHAP waterfall plot, which analyzes the contributions of individual features to a single prediction or instance, the SHAP feature importance plot present below analyzes the importance of each feature across all instances in the dataset. The plot displays a bar chart of the feature importance values, with the most important features listed at the top.

In [ ]:
shap.plots.bar(shap_values_linear_regression)

SHAP Dependance plots¶

In [ ]:
### To get dependance plots for all the columns present in the database, we iterate over the columns and priovide the dependance plots.

for i in x_train.columns:
  shap.dependence_plot(i, shap_values_linear_regression.values, x_test)

SHAP dependence plots are used to visualize how the value of a particular feature affects the model's predictions. These plots can help you understand the relationship between a feature and the target variable, and how that relationship changes based on the values of other features in the dataset.

The SHAP dependence plot displays a scatter plot of the feature values and the corresponding SHAP values for each instance in the dataset. The plot shows how the feature value and the SHAP value are related, with the color of each point representing the value of another feature in the dataset.

Heatmap of the shap values¶

SHAP heatmap is useful for visualizing how different features interact with each other and how they contribute to the model's output. The heatmap can help you identify which features are most important for the model's predictions, and how those features change across different observations. This can be particularly useful in identifying patterns and trends in your data, and in understanding how your model is making decisions.

In [ ]:
shap.plots.heatmap(shap_values_linear_regression)

From above we can see for individual distribututions how YearsWithCurrManager varies over various observation having both positive and negative impact on the final output of the model.

Decision Tree¶

In [ ]:
from sklearn import tree
import graphviz

### We initially specify the max depth the model is allowed to create a tree for. 
regressor = tree.DecisionTreeClassifier(random_state=0, max_depth=4)
regressor = regressor.fit(x_train, y_train)
predictions = regressor.predict(x_test)
### Again we round up the values.
predictions = np.rint(predictions)
analysis = pd.DataFrame()
analysis['Y_Predictions'] = predictions
analysis['Y_actual'] = y_test
analysis = analysis.dropna()
print(analysis)
#sns.residplot(x = analysis['Y_actual'], y = analysis['Y_Predictions'])
print("The mean squared error is: {}".format(mean_squared_error(analysis['Y_actual'], analysis['Y_Predictions'])))
print("The R^2 score on train data is: {}".format(Linear_regression.score(x_train, y_train)))

### Visualizing the decision tree graph.
#dot_data = tree.export_graphviz(regressor, out_file=None, feature_names=x_train.columns, class_names=y_train.columns)
dot_data = tree.export_graphviz(regressor, out_file=None, feature_names=list(x_train.columns),  filled=True)
graph = graphviz.Source(dot_data)
graph
     Y_Predictions  Y_actual
0              0.0       1.0
4              0.0       0.0
8              0.0       0.0
12             0.0       0.0
15             0.0       0.0
..             ...       ...
351            0.0       0.0
353            0.0       0.0
354            0.0       0.0
355            0.0       0.0
358            0.0       0.0

[90 rows x 2 columns]
The mean squared error is: 0.24444444444444444
The R^2 score on train data is: 0.24095264654344173
Out[ ]:
Tree 0 OverTime <= 0.5 gini = 0.275 samples = 1102 value = [921, 181] 1 Age <= 0.083 gini = 0.197 samples = 802 value = [713, 89] 0->1 True 16 Age <= 0.417 gini = 0.425 samples = 300 value = [208, 92] 0->16 False 2 EducationField_Medical <= 0.5 gini = 0.497 samples = 24 value = [13, 11] 1->2 9 StockOptionLevel <= 0.5 gini = 0.18 samples = 778 value = [700, 78] 1->9 3 RelationshipSatisfaction <= 3.5 gini = 0.48 samples = 15 value = [6, 9] 2->3 6 RelationshipSatisfaction <= 3.5 gini = 0.346 samples = 9 value = [7, 2] 2->6 4 gini = 0.32 samples = 10 value = [2, 8] 3->4 5 gini = 0.32 samples = 5 value = [4, 1] 3->5 7 gini = 0.0 samples = 6 value = [6, 0] 6->7 8 gini = 0.444 samples = 3 value = [1, 2] 6->8 10 JobSatisfaction <= 1.5 gini = 0.262 samples = 323 value = [273, 50] 9->10 13 YearsAtCompany <= 38.0 gini = 0.116 samples = 455 value = [427, 28] 9->13 11 gini = 0.451 samples = 64 value = [42, 22] 10->11 12 gini = 0.193 samples = 259 value = [231, 28] 10->12 14 gini = 0.112 samples = 454 value = [427, 27] 13->14 15 gini = 0.0 samples = 1 value = [0, 1] 13->15 17 YearsAtCompany <= 1.5 gini = 0.498 samples = 145 value = [77, 68] 16->17 24 JobSatisfaction <= 3.5 gini = 0.262 samples = 155 value = [131, 24] 16->24 18 MaritalStatus_Divorced <= 0.5 gini = 0.342 samples = 32 value = [7, 25] 17->18 21 EnvironmentSatisfaction <= 2.5 gini = 0.471 samples = 113 value = [70, 43] 17->21 19 gini = 0.245 samples = 28 value = [4, 24] 18->19 20 gini = 0.375 samples = 4 value = [3, 1] 18->20 22 gini = 0.478 samples = 38 value = [15, 23] 21->22 23 gini = 0.391 samples = 75 value = [55, 20] 21->23 25 StockOptionLevel <= 0.5 gini = 0.348 samples = 98 value = [76, 22] 24->25 28 TrainingTimesLastYear <= 5.5 gini = 0.068 samples = 57 value = [55, 2] 24->28 26 gini = 0.469 samples = 40 value = [25, 15] 25->26 27 gini = 0.212 samples = 58 value = [51, 7] 25->27 29 gini = 0.035 samples = 56 value = [55, 1] 28->29 30 gini = 0.0 samples = 1 value = [0, 1] 28->30

Shap analysis (For decision tree)¶

Similarly like above we will be doing shap analysis for our decision tree model.

In [ ]:
explainer = shap.Explainer(regressor.predict, x_train)
shap_values_decision_tree = explainer(x_train)
shap.summary_plot(shap_values_decision_tree, x_train, plot_type="bar")
shap.summary_plot(shap_values_decision_tree, x_train)
shap.plots.waterfall(shap_values_decision_tree[30], max_display=15)
Permutation explainer: 1103it [00:54, 19.55it/s]
No data for colormapping provided via 'c'. Parameters 'vmin', 'vmax' will be ignored

Shap variable importance plots¶

In [ ]:
shap.plots.bar(shap_values_decision_tree)

SHAP Dependance plots¶

In [ ]:
for i in x_train.columns:
  shap.dependence_plot(i, shap_values_decision_tree.values, x_train)

Heatmap of the shap annalysis for the decsion tree¶

In [ ]:
shap.plots.heatmap(shap_values_decision_tree)

AutoML¶

In [ ]:
### We first start with initializing a cluster and shutting it down if any error is encountered
try:
  h2o.init()
except:
  print("Unexpected error occured")
  h2o.cluster().shutdown()
Checking whether there is an H2O instance running at http://localhost:54321. connected.
H2O_cluster_uptime: 2 hours 31 mins
H2O_cluster_timezone: Etc/UTC
H2O_data_parsing_timezone: UTC
H2O_cluster_version: 3.40.0.3
H2O_cluster_version_age: 4 days
H2O_cluster_name: H2O_from_python_unknownUser_qs15wd
H2O_cluster_total_nodes: 1
H2O_cluster_free_memory: 2.990 Gb
H2O_cluster_total_cores: 2
H2O_cluster_allowed_cores: 2
H2O_cluster_status: locked, healthy
H2O_connection_url: http://localhost:54321
H2O_connection_proxy: {"http": null, "https": null, "colab_language_server": "/usr/colab/bin/language_service"}
H2O_internal_security: False
Python_version: 3.9.16 final
In [ ]:
### Here we pass our data to be converted in an H2OFrame. This data type will allow for easier manipulation and passing of the data for the training purpose
df = h2o.H2OFrame(data)
### Here we split the dat into train and test and print out the dataset
df_train, df_test = df.split_frame([0.75])
print(df_train)
Parse progress: |████████████████████████████████████████████████████████████████| (done) 100%
     Age    Attrition    DistanceFromHome    Education    EnvironmentSatisfaction    Gender    JobInvolvement    JobSatisfaction    MonthlyRate    NumCompaniesWorked    OverTime    PercentSalaryHike    RelationshipSatisfaction    StockOptionLevel    TrainingTimesLastYear    WorkLifeBalance    YearsAtCompany    YearsInCurrentRole    YearsSinceLastPromotion    YearsWithCurrManager    BusinessTravel_Non-Travel    BusinessTravel_Travel_Frequently    BusinessTravel_Travel_Rarely    Department_Human Resources    Department_Research & Development    Department_Sales    MaritalStatus_Divorced    MaritalStatus_Married    MaritalStatus_Single    EducationField_Human Resources    EducationField_Life Sciences    EducationField_Marketing    EducationField_Medical    EducationField_Other    EducationField_Technical Degree
0.547619            1           0                    2                          2         1                 3                  4      0.698053                      8           1                   11                           1                   0                        0                  1                 6                     4                          0                       5                            0                                   0                               1                             0                                    0                   1                         0                        0                       1                                 0                               1                           0                         0                       0                                  0
0.452381            1           0.0357143            2                          4         0                 2                  3      0.0121261                     6           1                   15                           2                   0                        3                  3                 0                     0                          0                       0                            0                                   0                               1                             0                                    1                   0                         0                        0                       1                                 0                               0                           0                         0                       1                                  0
0.357143            0           0.0714286            4                          4         1                 3                  3      0.845814                      1           1                   11                           3                   0                        3                  3                 8                     7                          3                       0                            0                                   1                               0                             0                                    1                   0                         0                        1                       0                                 0                               1                           0                         0                       0                                  0
0.97619             0           0.0714286            3                          3         1                 4                  1      0.316001                      4           1                   20                           1                   3                        3                  2                 1                     0                          0                       0                            0                                   0                               1                             0                                    1                   0                         0                        1                       0                                 0                               0                           0                         1                       0                                  0
0.285714            0           0.821429             1                          4         0                 3                  3      0.451355                      1           0                   22                           2                   1                        2                  3                 1                     0                          0                       0                            0                                   0                               1                             0                                    1                   0                         1                        0                       0                                 0                               1                           0                         0                       0                                  0
0.47619             0           0.785714             3                          4         0                 2                  3      0.268741                      0           0                   21                           2                   0                        2                  3                 9                     7                          1                       8                            0                                   1                               0                             0                                    1                   0                         0                        0                       1                                 0                               1                           0                         0                       0                                  0
0.428571            0           0.928571             3                          3         0                 3                  3      0.58153                       6           0                   13                           2                   2                        3                  2                 7                     7                          7                       7                            0                                   0                               1                             0                                    1                   0                         0                        1                       0                                 0                               0                           0                         1                       0                                  0
0.380952            0           0.642857             2                          2         0                 3                  4      0.267577                      0           0                   11                           3                   1                        2                  3                 2                     2                          1                       2                            0                                   0                               1                             0                                    1                   0                         1                        0                       0                                 0                               0                           0                         1                       0                                  0
0.261905            0           0.714286             4                          2         1                 4                  1      0.325276                      1           0                   11                           3                   1                        1                  3                10                     9                          8                       8                            0                                   0                               1                             0                                    1                   0                         1                        0                       0                                 0                               1                           0                         0                       0                                  0
0.333333            0           0.142857             2                          1         0                 4                  2      0.520337                      0           1                   12                           4                   2                        5                  2                 6                     2                          0                       5                            0                                   0                               1                             0                                    1                   0                         1                        0                       0                                 0                               1                           0                         0                       0                                  0
[1091 rows x 35 columns]

In [ ]:
### Here we specify our x and y values that we want to pass. 
x = df.columns
y = 'Attrition'
x = x.remove(y)

### As out data isn't a logistic regression/binary classification we can remove the below two lines.
df_train[y] = df_train[y].asfactor()
df_test[y] = df_test[y].asfactor()


### Finally we create an instance of AutoML and specify max runtime and make sure the balance class is set to true as we had seen earlier that our number of sample of classes weren't even.
aml = H2OAutoML(max_runtime_secs=222, balance_classes=True, seed=1) 
aml.train(x = x, y = y, training_frame = df_train)
AutoML progress: |███████████████████████████████████████████████████████████████| (done) 100%
Out[ ]:
Model Details
=============
H2OStackedEnsembleEstimator : Stacked Ensemble
Model Key: StackedEnsemble_BestOfFamily_4_AutoML_3_20230409_02208
Model Summary for Stacked Ensemble:
key value
Stacking strategy cross_validation
Number of base models (used / total) 5/6
# GBM base models (used / total) 1/1
# XGBoost base models (used / total) 1/1
# GLM base models (used / total) 1/1
# DeepLearning base models (used / total) 1/1
# DRF base models (used / total) 1/2
Metalearner algorithm GLM
Metalearner fold assignment scheme Random
Metalearner nfolds 5
Metalearner fold_column None
Custom metalearner hyperparameters None
ModelMetricsBinomialGLM: stackedensemble
** Reported on train data. **

MSE: 0.06708424630757806
RMSE: 0.25900626692722717
LogLoss: 0.23750592146249005
AUC: 0.9310165782739309
AUCPR: 0.8174902990490374
Gini: 0.8620331565478618
Null degrees of freedom: 1090
Residual degrees of freedom: 1085
Null deviance: 967.4113791384584
Residual deviance: 518.2379206311532
AIC: 530.2379206311532
Confusion Matrix (Act/Pred) for max f1 @ threshold = 0.3370898102777562
0 1 Error Rate
0 872.0 42.0 0.046 (42.0/914.0)
1 49.0 128.0 0.2768 (49.0/177.0)
Total 921.0 170.0 0.0834 (91.0/1091.0)
Maximum Metrics: Maximum metrics at their respective thresholds
metric threshold value idx
max f1 0.3370898 0.7377522 125.0
max f2 0.2533374 0.7749469 166.0
max f0point5 0.4688383 0.8093797 95.0
max accuracy 0.4688383 0.9230064 95.0
max precision 0.9635126 1.0 0.0
max recall 0.0120916 1.0 383.0
max specificity 0.9635126 1.0 0.0
max absolute_mcc 0.4334891 0.6935376 100.0
max min_per_class_accuracy 0.2037646 0.8566740 194.0
max mean_per_class_accuracy 0.2533374 0.8642893 166.0
max tns 0.9635126 914.0 0.0
max fns 0.9635126 176.0 0.0
max fps 0.0014769 914.0 399.0
max tps 0.0120916 177.0 383.0
max tnr 0.9635126 1.0 0.0
max fnr 0.9635126 0.9943503 0.0
max fpr 0.0014769 1.0 399.0
max tpr 0.0120916 1.0 383.0
Gains/Lift Table: Avg response rate: 16.22 %, avg score: 16.98 %
group cumulative_data_fraction lower_threshold lift cumulative_lift response_rate score cumulative_response_rate cumulative_score capture_rate cumulative_capture_rate gain cumulative_gain kolmogorov_smirnov
1 0.0100825 0.9034193 6.1638418 6.1638418 1.0 0.9295806 1.0 0.9295806 0.0621469 0.0621469 516.3841808 516.3841808 0.0621469
2 0.0201650 0.8297520 6.1638418 6.1638418 1.0 0.8658675 1.0 0.8977240 0.0621469 0.1242938 516.3841808 516.3841808 0.1242938
3 0.0302475 0.7778185 6.1638418 6.1638418 1.0 0.8105521 1.0 0.8686667 0.0621469 0.1864407 516.3841808 516.3841808 0.1864407
4 0.0403300 0.7355400 6.1638418 6.1638418 1.0 0.7595903 1.0 0.8413976 0.0621469 0.2485876 516.3841808 516.3841808 0.2485876
5 0.0504125 0.6893542 6.1638418 6.1638418 1.0 0.7084455 1.0 0.8148072 0.0621469 0.3107345 516.3841808 516.3841808 0.3107345
6 0.1008249 0.4977528 4.8190036 5.4914227 0.7818182 0.5824157 0.8909091 0.6986114 0.2429379 0.5536723 381.9003595 449.1422702 0.5405432
7 0.1503208 0.3474040 3.0819209 4.6980502 0.5 0.4151475 0.7621951 0.6052758 0.1525424 0.7062147 208.1920904 369.8050158 0.6635451
8 0.2007333 0.2645546 1.9051875 3.9966463 0.3090909 0.3033519 0.6484018 0.5294501 0.0960452 0.8022599 90.5187468 299.6646286 0.7180148
9 0.3006416 0.1671354 0.8482351 2.9503755 0.1376147 0.2137292 0.4786585 0.4245307 0.0847458 0.8870056 -15.1764889 195.0375500 0.6999159
10 0.4005500 0.1166096 0.5089411 2.3414136 0.0825688 0.1411636 0.3798627 0.3538510 0.0508475 0.9378531 -49.1058933 134.1413593 0.6413542
11 0.5004583 0.0860542 0.2261960 1.9191449 0.0366972 0.0993268 0.3113553 0.3030394 0.0225989 0.9604520 -77.3803970 91.9144885 0.5490734
12 0.6003666 0.0601386 0.0 1.5997757 0.0 0.0712955 0.2595420 0.2644744 0.0 0.9604520 -100.0 59.9775736 0.4298174
13 0.7002750 0.0414830 0.2261960 1.4038069 0.0366972 0.0511607 0.2277487 0.2340409 0.0225989 0.9830508 -77.3803970 40.3806904 0.3375366
14 0.8001833 0.0254370 0.1130980 1.2426531 0.0183486 0.0321108 0.2016037 0.2088286 0.0112994 0.9943503 -88.6901985 24.2653102 0.2317682
15 0.9000917 0.0136238 0.0 1.1047211 0.0 0.0191969 0.1792261 0.1877798 0.0 0.9943503 -100.0 10.4721139 0.1125122
16 1.0 0.0010811 0.0565490 1.0 0.0091743 0.0082873 0.1622365 0.1698470 0.0056497 1.0 -94.3450993 0.0 0.0
ModelMetricsBinomialGLM: stackedensemble
** Reported on cross-validation data. **

MSE: 0.09497797605451712
RMSE: 0.30818497052016847
LogLoss: 0.3252672555113068
AUC: 0.8325513975942341
AUCPR: 0.6076576875662298
Gini: 0.6651027951884683
Null degrees of freedom: 1090
Residual degrees of freedom: 1085
Null deviance: 967.6398783795136
Residual deviance: 709.7331515256715
AIC: 721.7331515256715
Confusion Matrix (Act/Pred) for max f1 @ threshold = 0.22764448562840753
0 1 Error Rate
0 785.0 129.0 0.1411 (129.0/914.0)
1 55.0 122.0 0.3107 (55.0/177.0)
Total 840.0 251.0 0.1687 (184.0/1091.0)
Maximum Metrics: Maximum metrics at their respective thresholds
metric threshold value idx
max f1 0.2276445 0.5700935 178.0
max f2 0.1377375 0.6472197 242.0
max f0point5 0.4863310 0.6330275 78.0
max accuracy 0.4863310 0.8799267 78.0
max precision 0.9697100 1.0 0.0
max recall 0.0012486 1.0 398.0
max specificity 0.9697100 1.0 0.0
max absolute_mcc 0.3821308 0.4971459 109.0
max min_per_class_accuracy 0.1554190 0.7627119 226.0
max mean_per_class_accuracy 0.2158388 0.7776150 185.0
max tns 0.9697100 914.0 0.0
max fns 0.9697100 176.0 0.0
max fps 0.0006290 914.0 399.0
max tps 0.0012486 177.0 398.0
max tnr 0.9697100 1.0 0.0
max fnr 0.9697100 0.9943503 0.0
max fpr 0.0006290 1.0 399.0
max tpr 0.0012486 1.0 398.0
Gains/Lift Table: Avg response rate: 16.22 %, avg score: 16.20 %
group cumulative_data_fraction lower_threshold lift cumulative_lift response_rate score cumulative_response_rate cumulative_score capture_rate cumulative_capture_rate gain cumulative_gain kolmogorov_smirnov
1 0.0100825 0.8740287 6.1638418 6.1638418 1.0 0.9146462 1.0 0.9146462 0.0621469 0.0621469 516.3841808 516.3841808 0.0621469
2 0.0201650 0.7934841 5.0431433 5.6034926 0.8181818 0.8207936 0.9090909 0.8677199 0.0508475 0.1129944 404.3143297 460.3492553 0.1108062
3 0.0302475 0.7221088 5.0431433 5.4167095 0.8181818 0.7606231 0.8787879 0.8320210 0.0508475 0.1638418 404.3143297 441.6709468 0.1594654
4 0.0403300 0.6881472 4.4827940 5.1832306 0.7272727 0.7059202 0.8409091 0.8004958 0.0451977 0.2090395 348.2794042 418.3230611 0.2013809
5 0.0504125 0.6426058 3.3620955 4.8190036 0.5454545 0.6650404 0.7818182 0.7734047 0.0338983 0.2429379 236.2095532 381.9003595 0.2298088
6 0.1008249 0.4270645 3.6983051 4.2586543 0.6 0.5238979 0.6909091 0.6486513 0.1864407 0.4293785 269.8305085 325.8654340 0.3921794
7 0.1503208 0.3289637 2.0546139 3.5329337 0.3333333 0.3745490 0.5731707 0.5583981 0.1016949 0.5310734 105.4613936 253.2933719 0.4544870
8 0.2007333 0.2587095 1.9051875 3.1241390 0.3090909 0.2893493 0.5068493 0.4908288 0.0960452 0.6271186 90.5187468 212.4138999 0.5089567
9 0.3006416 0.1715995 1.0744311 2.4429861 0.1743119 0.2083196 0.3963415 0.3969462 0.1073446 0.7344633 7.4431141 144.2986082 0.5178331
10 0.4005500 0.1194211 0.7916861 2.0311058 0.1284404 0.1406818 0.3295195 0.3330267 0.0790960 0.8135593 -20.8313896 103.1105767 0.4929904
11 0.5004583 0.0872070 0.3392940 1.6933631 0.0550459 0.1014152 0.2747253 0.2867892 0.0338983 0.8474576 -66.0705956 69.3363134 0.4141972
12 0.6003666 0.0603426 0.4523921 1.4868504 0.0733945 0.0727316 0.2412214 0.2511674 0.0451977 0.8926554 -54.7607941 48.6850390 0.3488917
13 0.7002750 0.0418816 0.3958431 1.3311962 0.0642202 0.0508029 0.2159686 0.2225814 0.0395480 0.9322034 -60.4156948 33.1196202 0.2768423
14 0.8001833 0.0257670 0.2827450 1.2002899 0.0458716 0.0333123 0.1947308 0.1989498 0.0282486 0.9604520 -71.7254963 20.0289928 0.1913054
15 0.9000917 0.0143539 0.2261960 1.0921675 0.0366972 0.0199862 0.1771894 0.1790852 0.0225989 0.9830508 -77.3803970 9.2167489 0.0990246
16 1.0 0.0005999 0.1696470 1.0 0.0275229 0.0079601 0.1622365 0.1619884 0.0169492 1.0 -83.0352978 0.0 0.0
Cross-Validation Metrics Summary:
mean sd cv_1_valid cv_2_valid cv_3_valid cv_4_valid cv_5_valid
accuracy 0.8619171 0.0353565 0.8812785 0.8584475 0.8807340 0.8018018 0.8873239
auc 0.8309743 0.0361391 0.8771904 0.7842946 0.8139499 0.8244461 0.8549906
err 0.1380829 0.0353565 0.1187215 0.1415525 0.1192661 0.1981982 0.1126761
err_count 30.2 8.136338 26.0 31.0 26.0 44.0 24.0
f0point5 0.5980484 0.0823182 0.6521739 0.5529954 0.6418919 0.4745763 0.6686047
f1 0.6072270 0.0353989 0.6176470 0.6075949 0.59375 0.56 0.6571429
f2 0.6284139 0.0568205 0.5865922 0.6741573 0.5523256 0.6829268 0.6460674
lift_top_group 5.7512155 0.9013953 5.918919 6.6363635 6.0555553 4.2285714 5.9166665
logloss 0.3250895 0.0239453 0.2960839 0.3356675 0.3389248 0.3512427 0.3035282
max_per_class_error 0.3476986 0.1121399 0.4324324 0.2727273 0.4722222 0.2 0.3611111
--- --- --- --- --- --- --- ---
mean_per_class_error 0.2221176 0.0288116 0.2436888 0.1955034 0.2608364 0.1989305 0.211629
mse 0.0949170 0.0071164 0.0880642 0.0956056 0.0994174 0.1039190 0.0875787
null_deviance 193.52797 4.760831 199.06012 186.0061 195.38908 193.53395 193.65062
pr_auc 0.6072177 0.0929142 0.6862503 0.5429464 0.6014975 0.4929664 0.7124276
precision 0.5969939 0.114819 0.6774194 0.5217391 0.6785714 0.4307692 0.6764706
r2 0.2997116 0.0717751 0.3727874 0.2529586 0.2788899 0.2174877 0.3764347
recall 0.6523014 0.1121399 0.5675676 0.7272728 0.5277778 0.8 0.6388889
residual_deviance 141.94662 11.895875 129.68475 147.0224 147.77122 155.95175 129.30301
rmse 0.3079130 0.0115440 0.2967562 0.3092016 0.3153053 0.3223647 0.295937
specificity 0.9034634 0.0629861 0.9450549 0.8817204 0.9505494 0.8021390 0.9378531
[22 rows x 8 columns]

[tips]
Use `model.explain()` to inspect the model.
--
Use `h2o.display.toggle_user_tips()` to switch on/off this section.
In [ ]:
### Here we are printing out the leaderboard which shows which model performed best.
best_model = aml.leaderboard
print(best_model)
model_id                                                     auc    logloss     aucpr    mean_per_class_error      rmse        mse
StackedEnsemble_BestOfFamily_4_AutoML_3_20230409_02208  0.832551   0.325267  0.607658                0.225936  0.308185  0.094978
StackedEnsemble_BestOfFamily_3_AutoML_3_20230409_02208  0.83163    0.324762  0.61299                 0.252714  0.308019  0.094876
StackedEnsemble_BestOfFamily_1_AutoML_3_20230409_02208  0.83096    0.328013  0.602189                0.237047  0.31005   0.0961308
StackedEnsemble_AllModels_2_AutoML_3_20230409_02208     0.830088   0.327607  0.603268                0.248436  0.309847  0.0960054
StackedEnsemble_AllModels_1_AutoML_3_20230409_02208     0.829074   0.329436  0.593937                0.239872  0.311104  0.0967857
GLM_1_AutoML_3_20230409_02208                           0.828969   0.329588  0.601384                0.286872  0.310004  0.0961023
StackedEnsemble_BestOfFamily_5_AutoML_3_20230409_02208  0.827965   0.330358  0.592995                0.245791  0.31115   0.0968144
StackedEnsemble_BestOfFamily_2_AutoML_3_20230409_02208  0.82484    0.332599  0.590864                0.272846  0.31144   0.0969947
XGBoost_grid_1_AutoML_3_20230409_02208_model_27         0.820501   0.338062  0.552403                0.233317  0.318518  0.101454
XGBoost_grid_1_AutoML_3_20230409_02208_model_1          0.818356   0.337356  0.562359                0.260283  0.317408  0.100748
[81 rows x 7 columns]

In [ ]:
### For simple metric analysis we are here printing out the performance of the model that was at the top of our leaderboard.
model_imp = aml.leader
model_imp.model_performance(df_test)
Out[ ]:
ModelMetricsBinomialGLM: stackedensemble
** Reported on test data. **

MSE: 0.09595816938898644
RMSE: 0.3097711564832763
LogLoss: 0.33105234312506543
AUC: 0.8180250783699059
AUCPR: 0.5994292798981069
Gini: 0.6360501567398118
Null degrees of freedom: 378
Residual degrees of freedom: 373
Null deviance: 331.1824167648591
Residual deviance: 250.9376760887996
AIC: 262.9376760887996
Confusion Matrix (Act/Pred) for max f1 @ threshold = 0.3786996322299554
0 1 Error Rate
0 297.0 22.0 0.069 (22.0/319.0)
1 27.0 33.0 0.45 (27.0/60.0)
Total 324.0 55.0 0.1293 (49.0/379.0)
Maximum Metrics: Maximum metrics at their respective thresholds
metric threshold value idx
max f1 0.3786996 0.5739130 54.0
max f2 0.1584724 0.6451613 131.0
max f0point5 0.5142324 0.6382979 31.0
max accuracy 0.5142324 0.8839050 31.0
max precision 0.9454817 1.0 0.0
max recall 0.0095000 1.0 357.0
max specificity 0.9454817 1.0 0.0
max absolute_mcc 0.4846359 0.5004756 35.0
max min_per_class_accuracy 0.1738946 0.7492163 124.0
max mean_per_class_accuracy 0.1584724 0.7683386 131.0
max tns 0.9454817 319.0 0.0
max fns 0.9454817 59.0 0.0
max fps 0.0018311 319.0 378.0
max tps 0.0095000 60.0 357.0
max tnr 0.9454817 1.0 0.0
max fnr 0.9454817 0.9833333 0.0
max fpr 0.0018311 1.0 378.0
max tpr 0.0095000 1.0 357.0
Gains/Lift Table: Avg response rate: 15.83 %, avg score: 16.86 %
group cumulative_data_fraction lower_threshold lift cumulative_lift response_rate score cumulative_response_rate cumulative_score capture_rate cumulative_capture_rate gain cumulative_gain kolmogorov_smirnov
1 0.0105541 0.8223484 6.3166667 6.3166667 1.0 0.9033537 1.0 0.9033537 0.0666667 0.0666667 531.6666667 531.6666667 0.0666667
2 0.0211082 0.7563015 6.3166667 6.3166667 1.0 0.7854710 1.0 0.8444123 0.0666667 0.1333333 531.6666667 531.6666667 0.1333333
3 0.0316623 0.7067602 4.7375 5.7902778 0.75 0.7287966 0.9166667 0.8058738 0.05 0.1833333 373.75 479.0277778 0.1801985
4 0.0422164 0.6521768 4.7375 5.5270833 0.75 0.6755749 0.875 0.7732990 0.05 0.2333333 373.75 452.7083333 0.2270637
5 0.0501319 0.6211872 4.2111111 5.3192982 0.6666667 0.6344225 0.8421053 0.7513712 0.0333333 0.2666667 321.1111111 431.9298246 0.2572623
6 0.1002639 0.4736257 3.3245614 4.3219298 0.5263158 0.5432394 0.6842105 0.6473053 0.1666667 0.4333333 232.4561404 332.1929825 0.3957158
7 0.1503958 0.3628090 2.3271930 3.6570175 0.3684211 0.4179844 0.5789474 0.5708650 0.1166667 0.55 132.7192982 265.7017544 0.4747649
8 0.2005277 0.2832859 0.6649123 2.9089912 0.1052632 0.3213881 0.4605263 0.5084957 0.0333333 0.5833333 -33.5087719 190.8991228 0.4548067
9 0.3007916 0.1953977 1.3298246 2.3826023 0.2105263 0.2333114 0.3771930 0.4167676 0.1333333 0.7166667 32.9824561 138.2602339 0.4940961
10 0.4010554 0.1315866 0.9973684 2.0362939 0.1578947 0.1572641 0.3223684 0.3518917 0.1 0.8166667 -0.2631579 103.6293860 0.4937827
11 0.5013193 0.0795640 0.4986842 1.7287719 0.0789474 0.1032970 0.2736842 0.3021728 0.05 0.8666667 -50.1315789 72.8771930 0.4340648
12 0.5989446 0.0559557 0.3414414 1.5026432 0.0540541 0.0668548 0.2378855 0.2638170 0.0333333 0.9 -65.8558559 50.2643172 0.3576803
13 0.6992084 0.0392624 0.0 1.2871698 0.0 0.0471835 0.2037736 0.2327526 0.0 0.9 -100.0 28.7169811 0.2385580
14 0.7994723 0.0234738 0.3324561 1.1674367 0.0526316 0.0302179 0.1848185 0.2073522 0.0333333 0.9333333 -66.7543860 16.7436744 0.1590387
15 0.8997361 0.0147036 0.1662281 1.0558651 0.0263158 0.0192044 0.1671554 0.1863856 0.0166667 0.95 -83.3771930 5.5865103 0.0597179
16 1.0 0.0018311 0.4986842 1.0 0.0789474 0.0087104 0.1583113 0.1685712 0.05 1.0 -50.1315789 0.0 0.0

Shap analysis (For autoML)¶

Now to perform better analysis, we can't use our leaderboard's top performer i.e. Stacked ensemble. This is becuase the ensemble is a collection of different models predicting different values and finally making a decision based on the value.

Thus we will be using GLM or Generalized Linear Model.

In [ ]:
### As discussed earlier we are specifying to use GLM model but also making sure the model that is being picked for analysis is the best model in a collection of GLMs.
model_to_explain = aml.get_best_model(algorithm='gbm')
explain = model_to_explain.explain(df_test)

Confusion Matrix

Confusion matrix shows a predicted class vs an actual class.

GBM_grid_1_AutoML_3_20230409_02208_model_14

Confusion Matrix (Act/Pred) for max f1 @ threshold = 0.2774694883568499
0 1 Error Rate
0 309.0 10.0 0.0313 (10.0/319.0)
1 29.0 31.0 0.4833 (29.0/60.0)
Total 338.0 41.0 0.1029 (39.0/379.0)

Learning Curve Plot

Learning curve plot shows the loss function/metric dependent on number of iterations or trees for tree-based algorithms. This plot can be useful for determining whether the model overfits.

Variable Importance

The variable importance plot shows the relative importance of the most important variables in the model.

SHAP Summary

SHAP summary plot shows the contribution of the features for each instance (row of data). The sum of the feature contributions and the bias term is equal to the raw prediction of the model, i.e., prediction before applying inverse link function.

Partial Dependence Plots

Partial dependence plot (PDP) gives a graphical depiction of the marginal effect of a variable on the response. The effect of a variable is measured in change in the mean response. PDP assumes independence between the feature for which is the PDP computed and the rest.





In [ ]:
from h2o.estimators.glm import H2OGeneralizedLinearEstimator
#glm_parameters = {"learn_rate": [i * 0.01 for i in range(1, 11)], "max_depth": list(range(2, 11)), "sample_rate": [i * 0.1 for i in range(5, 11)],"col_sample_rate": [i * 0.1 for i in range(1, 11)], "alpha": [0.05, 0.01]}
glm_parameters = {"lambda": [i * 0.01 for i in range(1, 11)], "alpha": [i*0.01 for i in range(1, 11)]}

search_criteria = {"strategy": "RandomDiscrete", "max_models": 30, "seed": 1}
#glm_grid = H2OGridSearch(model=H2OGradientBoostingEstimator, grid_id="gbm_grid_search", hyper_params=glm_parameters, search_criteria=search_criteria)
glm_grid = H2OGridSearch(H2OGeneralizedLinearEstimator('binomial'), glm_parameters)
#glm_grid.train(x=x, y=y, training_frame=df_train, ntrees=100, seed=1)

glm_grid.train(x=x, y=y, training_frame=df_train)

gbm_gridperf2 = glm_grid.get_grid(sort_by="rmse", decreasing =False)
print(gbm_gridperf2)
glm Grid Build progress: |███████████████████████████████████████████████████████| (done) 100%
Adding alpha array to hyperparameter runs slower with gridsearch. This is due to the fact that the algo has to run initialization for every alpha value. Setting the alpha array as a model parameter will skip the initialization and run faster overall.
Hyper-Parameter Search Summary: ordered by increasing rmse
     alpha    lambda    model_ids                                                           rmse
---  -------  --------  ------------------------------------------------------------------  -------------------
     0.01     0.01      Grid_GLM_py_797_sid_ae9b_model_python_1680990637213_7732_model_1    0.2983337073109158
     0.02     0.01      Grid_GLM_py_797_sid_ae9b_model_python_1680990637213_7732_model_2    0.29837778572418516
     0.03     0.01      Grid_GLM_py_797_sid_ae9b_model_python_1680990637213_7732_model_3    0.29841485181972793
     0.04     0.01      Grid_GLM_py_797_sid_ae9b_model_python_1680990637213_7732_model_4    0.29845666335716997
     0.05     0.01      Grid_GLM_py_797_sid_ae9b_model_python_1680990637213_7732_model_5    0.298498659621854
     0.06     0.01      Grid_GLM_py_797_sid_ae9b_model_python_1680990637213_7732_model_6    0.2985428135561869
     0.07     0.01      Grid_GLM_py_797_sid_ae9b_model_python_1680990637213_7732_model_7    0.2985870435248281
     0.08     0.01      Grid_GLM_py_797_sid_ae9b_model_python_1680990637213_7732_model_8    0.2986328545611587
     0.09     0.01      Grid_GLM_py_797_sid_ae9b_model_python_1680990637213_7732_model_9    0.29868835825715434
     0.1      0.01      Grid_GLM_py_797_sid_ae9b_model_python_1680990637213_7732_model_10   0.29873750363442564
---  ---      ---       ---                                                                 ---
     0.1      0.08      Grid_GLM_py_797_sid_ae9b_model_python_1680990637213_7732_model_80   0.32324533106527775
     0.05     0.1       Grid_GLM_py_797_sid_ae9b_model_python_1680990637213_7732_model_95   0.3234578927279676
     0.08     0.09      Grid_GLM_py_797_sid_ae9b_model_python_1680990637213_7732_model_88   0.3240593987620105
     0.06     0.1       Grid_GLM_py_797_sid_ae9b_model_python_1680990637213_7732_model_96   0.32440802934862745
     0.09     0.09      Grid_GLM_py_797_sid_ae9b_model_python_1680990637213_7732_model_89   0.32503769101581687
     0.07     0.1       Grid_GLM_py_797_sid_ae9b_model_python_1680990637213_7732_model_97   0.3254320639329165
     0.1      0.09      Grid_GLM_py_797_sid_ae9b_model_python_1680990637213_7732_model_90   0.3259384582347323
     0.08     0.1       Grid_GLM_py_797_sid_ae9b_model_python_1680990637213_7732_model_98   0.3265032379960393
     0.09     0.1       Grid_GLM_py_797_sid_ae9b_model_python_1680990637213_7732_model_99   0.3274818491529781
     0.1      0.1       Grid_GLM_py_797_sid_ae9b_model_python_1680990637213_7732_model_100  0.3284789500742035
[100 rows x 5 columns]

Conclusion:¶

What did you do?¶

For this assignment we performed 4 major steps

  1. Data cleaning/ Normalization
  2. Feature selection
  3. Modelling
  4. Interpretability

Data cleaning: In this we used correlation matrix to determine multicolinearity and scaling of data to make sure the model doesn't suffer with overfitting.

Feature selection: In feature selection we used the same correlation matrix to determina values that have a significant positive or negative relationship to the our output variable that is 'Attrition'.

Modelling: For modelling we used linear classifier model, decision tree and autoML to determine the best performance. For further improving the performance, we optimized the hyperparameters by using grid search that is a feature of H2O.

Interpreatability: Interpretability is a concept of understanding the model's prediction based on input. We use SHAP analysis to determine the significance of input variables and the respective output. For this we use:

  • SHAP Summary
  • SHAP Waterfall
  • SHAP Dependance plot

From above we figured out the relationship of various variables and whether they have a positive or negative impact and by what degree.

How well did it work?¶

In the above we used OLS summary to determine null hypothesis or not. As there was a significant relationship we could predict attrition from the input variables. The performance of the model can be determined by Root Mean Squared error and for our test we have used:

  1. Linear regression
  2. Decision tree
  3. AutoML

(Note: In autoML we use a collection of models and determine the best performing model using the leaderboard.)

Below are the details and performance of the training:

Model Performance (RMSE)
Linear Model 0.24095264654344173
Decision tree classifier 0.24095264654344173
AutoML (StackedEnsemble_BestOfFamily_4_AutoML_3_20230409_02208) 0.308185

From this we can see that the model performed well enough.

What did you learn?¶

From the above article we were able to learn the concepts of data clening, feature selection, variable significance, model types, performance metrics of a model and its output, optimization of a model performance and interpretting a model's output based on the given input.

These key concepts will help in further analyzing the data and also determine the best model that can work with the dataset. We can also use the teachings from the article to further analyze a model's output which will inturn help with understanding the hyperparameters and their tuning.

References

The references used for this article are as follows:

  • https://docs.h2o.ai/h2o/latest-stable/h2o-docs/automl.html
  • https://docs.h2o.ai/h2o/latest-stable/h2o-docs/explain.html
  • https://docs.h2o.ai/h2o/latest-stable/h2o-docs/training-models.html
  • https://betterdatascience.com/lime/#:~:text=What%20is%20LIME%3F,and%20image%20classifiers%20(currently)
  • https://github.com/aiskunks/YouTube/blob/main/A_Crash_Course_in_Statistical_Learning/Model_Interpretability/SHAP%20and%20LIME%20analysis%20Walkthrough.ipynb
  • https://mljar.com/blog/feature-importance-in-random-forest/

MIT License

Copyright © 2023 Tarush Ghanshyam Singh

Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.